Automated unsupervised authorship analysis using evidence accumulation clustering
نویسندگان
چکیده
Authorship Analysis aims to extract information about the authorship of documents from features within those documents. Typically, this is performed as a classification task with the aim of identifying the author of a document, given a set of documents of known authorship. Alternatively, unsupervised methods have been developed primarily as visualisation tools to assist the manual discovery of clusters of authorship within a corpus by analysts. However, there is a need in many fields for more sophisticated unsupervised methods to automate the discovery, profiling and organisation of related information through clustering of documents by authorship. An automated and unsupervised methodology for clustering documents by authorship is proposed in this paper. The methodology is named NUANCE, for n-gram Unsupervised Automated Natural Cluster Ensemble. Testing indicates that the derived clusters have a strong correlation to the true authorship of unseen documents.
منابع مشابه
Efficient Unsupervised Authorship Clustering Using Impostor Similarity
Some real-world authorship analysis applications require techniques that scale to thousands of documents with little or no a priori information about the number of candidate authors. While there is extensive research on identifying authors given a small set of candidates and ample training data, almost none is based on real-world applications of clustering documents by authorship, independent o...
متن کاملClustering by Authorship Within and Across Documents
The vast majority of previous studies in authorship attribution assume the existence of documents (or parts of documents) labeled by authorship to be used as training instances in either closed-set or open-set attribution. However, in several applications it is not easy or even possible to find such labeled data and it is necessary to build unsupervised attribution models that are able to estim...
متن کاملVote/Veto Classification, Ensemble Clustering and Sequence Classification for Author Identification
The Author Identification task for PAN 2012 consisted of three different sub-tasks: traditional authorship attribution, authorship clustering and sexual predator identification. We developed three machine learning approaches for these tasks. For the two authorship related tasks we created various sets of feature spaces, where individual differences in writing styles are assumed to surface in ju...
متن کاملOverview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering
Several authorship analysis tasks require the decomposition of a multiauthored text into its authorial components. In this regard two basic prerequisites need to be addressed: (1) style breach detection, i.e., the segmenting of a text into stylistically homogeneous parts, and (2) author clustering, i.e., the grouping of paragraph-length texts by authorship. In the current edition of PAN we focu...
متن کاملComparison Between Unsupervised and Supervise Fuzzy Clustering Method in Interactive Mode to Obtain the Best Result for Extract Subtle Patterns from Seismic Facies Maps
Pattern recognition on seismic data is a useful technique for generating seismic facies maps that capture changes in the geological depositional setting. Seismic facies analysis can be performed using the supervised and unsupervised pattern recognition methods. Each of these methods has its own advantages and disadvantages. In this paper, we compared and evaluated the capability of two unsuperv...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Natural Language Engineering
دوره 19 شماره
صفحات -
تاریخ انتشار 2013